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ABOUT THIS BOOK 


TRADEMARKS 


This book provides a general understanding of this component's proc- 
essing. It contains information that can be useful when diagnosSing a 
problem suspected to have been caused by the component. 


The following is a trademark of International Business Machines Corpo- 
ration: 


° MVS/ESACTM) 


WHO THIS BOOK IS FOR 


This book is for anyone who wants to gain a general understanding of the 
component's processing, or who needs to diagnose a problem that appears 
to be caused by the component. 


The level of detail at which this book is written assumes that the 
reader: 


e Understands the commonly used system diagnostic tasks and aids, such 
as those presented in the Basics of Problem Determination book 

e Codes in assembler language, and reads assembler, and linkage editor 
output 

e Understands basic system concerts and the use of system services 


° Understands the externals of the component 


HOW THIS BOOK IS ORGANIZED 


LY28-1432-0 


This book contains the following chapter headings. However, it might not 
include information for a chapter because the information is not appli- 
cable. If the information for a chapter is not applicable, the chapter 
heading will contain the words "Not Applicable". 

e Part 1. Diagnostic Procedures for the component 


_ Chapter 1. Diagnosis for this component suggests how to diagnose 
problems in the component. 


e Part 2. Reference for the component 


_ Chapter 2. Introduction to this component gives the functions 
performed by the component and the expected inputs and outputs. 


— Chapter 3. Control Block Overview shows the significant fields 
and the chaining structure of the component's control blocks. 


_ Chapter 4. Process Flow shows the control flow among the compo- 
nent's modules. 


_— Chapter 5. Method of Operation describes the functional organ- 
ization of the component. 


_ "™Tndex". 
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Basics of Problem Determination MVS/ESACTM) Basics of Problem GC28-1839 
Determination | 


IPCS Command Reference MVS/ESA Interactive Problem GC28-1834 
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Reference 
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Control System CIPCS) User's 
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Service Aids MVS/ESA Service Aids GC28-1844 
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GC28-1813 


IPCS User's Guide 
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SUMMARY OF CHANGES 


Summary of Changes 

for LY28-1432-0 

as updated December 22, 1989 

by Technical Newsletter LN28=-1334 


CHANGED INFORMATION: The Diagnosis Library figure has been updated. 


Minor editorial changes have been made in the front matter. 


Summary of Changes 
for LY28-1432-0 
MVS/System Product Version 3 Release 1.0 


This book contains information previously presented in MVS/XA System 


Logic Library: Alternate CPU Recovery, LY28-1617. The following summa- 
rizes the changes to that information. 


New Information 


"Part 1. Diagnostic Procedures for ACR™ is new and contains the fol- 
lowing: 


° "Chapter 1. Diagnosis for ACR" 

Part 2. Reference for ACR™ is new and contains the following: 
"Chapter 2. Introduction" 

"Chapter 3. Control Block Overview" 


"Chapter 4. Process Flow" 
"Chapter 5. Method of Operation" 


Changed Information 


LY28-1432-0 


ACR processing has been updated in support of this release. In addi- 
tion, the HIPO "Alternate CPU Recovery Overview" has been updated to 
include control register identification. 
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CHAPTER 1. DIAGNOSIS FOR ACR 


Prerequisites 


If you are diagnosing a suspected problem in this component, first 
collect the following problem data. If you do not have this data, 
refer to the Basics of Problem Determination book and perform its 
procedures. 


° The problem type, such as an ABEND or wait state. 
e The product name. 
e The component name: ACR 


Use this book to diagnose problems only in component ACR. If 
the component name is not ACR, return to the Basics of Problem 
Determination book to identify the component. Use the component 
diagnosis book for the identified component. 


e The module name 
e The symptoms used as the initial search argument. 


° A dump, if needed for the problem. Format the dump for compo- 
nent ACR, as described in Basics of Problem Determination. 


e The system execution status, if needed for the problem. 


The alternate CPU recovery component does not provide component-specific 
symptom data. If you report the problem to IBM, see the Basics of 
Problem Determination book for a list of problem data to collect. The 
remainder of this book might help you diagnose the problem. 


PROCESSING DIAGNOSIS FOR ACR 


Alternate CPU recovery CACR), running on a properly functioning 
processor, initiates the release of global resources that are held by 
programs that were running on the failed processor. ACR does this by 
changing control block pointers so that the operating system can pretend 
that it is running on the failed processor, and thereby run the func 
tional recovery routines (CFRRs) for that processor. This results in the 
release of the locks and other global resources held on the failed 
processor. 


ACR consists of three processing phases: pre-processing, intermediate, 
and post-processing. 


ACR pre-processing sets the LCCARCPU and LCCADCPU fields in the LCCAs of 
both processors with the addresses of the LCCAs of the recovery and 
failed processors respectively. 


The FRR processing for the failed processor could require a resource 
(such as the SALLOC lock to free storage) held by the recovery 
processor. In this case, the FRR also invokes ACR's intermediate phase. 
This entry results in the suspension of the FRR processing for the 
failed processor, and another processor-switch back to the recovery 
processor to resume its processing (to allow the resource needed by the 
failed processor's FRR to be freed). | 


ACR's post-processing consists of issuing a SYSEVENT to inform SRM about 
the failed processor, calling the service processor to take the failed 
processor physically offline, issuing the ACR-complete message, and 
resetting the flags to indicate that ACR is complete. | 
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PART 2. REFERENCE FOR ACR 
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CHAPTER 2. INTRODUCTION TO ALTERNATE CPU RECOVERY 


Alternate CPU recovery CACR) is the process by which the operating 
system dynamically adjusts to the unexpected failure of a processor ina 
multiprocessing configuration. ACR saves as much work from the failed 
processor as possible, and terminates work in progress as an abnormal 
termination condition. This allows ACR to attempt software recovery 
through the use of recovery and retry routines defined in the system at 
the time of the malfunction in the failed processor. Any tasks with 
processor affinity to the failed processor are terminated when they are 
redispatched. 


OPERATING ENVIRONMENT 


When the system is running in an MP environment, it must prevent pro- 
grams running on different processors from simultaneously assigning 
resources, such as storage, or updating critical data fields, such as 
ASM control blocks. The operating system does this by assigning a lock 
to each resource or data field, and by establishing a protocol whereby a 
program must obtain the appropriate lock before it acquires the related 
resource or updates the related data area. Since a lock can be held by 
only one program at a time, it prevents the simultaneous use of its 
resource by programs on different processors. 


When a program on one processor needs to use a resource but cannot 
obtain the related lock, it generally waits and Keeps trying to obtain 
the lock. The locking protocol requires that a program hold a lock for 
as short a time as possible and that the program holding a lock be disa- 
bled Cexcept for machine checks). This generally results in little or 
no waiting for a lock when the system is running normally. 


However, various hardware and software malfunctions can cause a 
processor to stop processing while the program running on that processor 
holds one or more locks. Since it is inevitable that some program 
running on the other processor will need one of these locks, the system 


will "hang™ unless a way can be found to identify and free Con the 
running processor) the locks held on the failed processor. 


ACR PROCESSING 
ACR consists of three phases: pre-processing, intermediate, and post-— 
processing. See Figure 1 on page ACR-13 for an overview of ACR proc- 
essing. 

PRE-PROCESSING PHASE 
When one of the processors in the system has a malfunction and is about 
to stop processing, it issues either a malfunction alert (CMFA) for a 
hardware failure or an emergency signal CEMS) for a software failure. 


The MFA or EMS interruption handler then 


e invokes ACR's pre-processing phase on the recovery Chealthy) 
processor to mark the failed processor offline 


e saves the failing CPU's system mask 
e saves the failing CPU's work unit status in its ACR work/save area 
@ sets up control block pointers for later recovery processing 


e sets the ACR flags. 
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When this processing is complete, control returns to the dispatcher that 
applies its normal algorithms to decide what processing should now be 
resumed on the recovery processor. The failed processor is logically 
offline and assumed to be not operating. The recovery processor and the 
failed processor are referred to as the "ACR-pair."™ 


INTERMEDIATE PHASE 


The dispatcher or a system routine that cannot obtain a global resource 

initiates ACR's intermediate phase. The system routine determines that 

the required resource is held by a processor other than the one on which 
the routine is running, and that ACR is in progress. 


ACR's intermediate phase Crunning on the recovery processor) switches 
some control block pointers and data fields to make the system think 
that it is running on the failed processor. ACR saves the current 
processor's work unit status in its ACR work/save area and then copies 
the other processor's work unit status from its ACR work/save area to 
the current processor. ACR then simulates a machine check, to get MCH 
to perform hardware recording. ACR exits from this phase, still simu- 
lating the failed processor, to RTM which initiates execution of the 
failed processor's FRRs. This frees all system resources held on the 
failed processor and allows the routine that initiated the intermediate 
phase to continue. 


This "ping-ponging™ between the two processors continues until both no 
longer hold any global resources and are ready to run enabled. At this 
point, the dispatcher initiates the call to ACR's post-processing phase. 


POST-PROCESSING PHASE 


The dispatcher initiates the ACR post-processing phase to complete ACR 
processing. ACR is not complete until the dispatcher calls ACR twice, 
once for the recovery processor, and once for the failed processor. 
Each entry indicates that the processing on the related processor is 
enabled. When the call is from the dispatcher, IEAVTACR determines if 
the other processor (Cthe one not presently being simulated), 1s ready to 
run enabled. If it is, ACR starts its post-processing phase. If the 
other processor is not ready to run enabled, ACR executes another 
processor-switch to simulate the other processor and continue with its 
processing. When this processor is ready to be enabled, the dispatcher 
is entered and calls ACR for the second time. This call results in the 
execution of the post-processing phase. 


If the failed processor did not hold any resources when it failed, the 
recovery processor never would have needed to initiate ACR's interme- 
diate phase. In this case, the termination of ACR is, initiated when the 
recovery processor next enters the dispatcher. The dispatcher still 
calls ACR and this call results in a processor-switch to the processor, 
the call to RIM for hardware processing, and subsequent exit from ACR to. 
RTM Cstill simulating the failed processor). Control eventually gets 
back to the dispatcher which will make the second call to the ACR post 
processing phase. This second call results in a switch back to the 
recovery processor and execution of the post-processing phase. 
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ADDRESSING AND RESIDENCY MODES 


Addressing mode determines whether a full word address is treated as 
either a 24-bit address or a 3l-bit address. Addressing mode is deter- 
mined by bit 32 of the PSW. For AMODE 24, a module must reside below 
the 16 megabyte line and can only reference data within this area of 
virtual storage. For AMODE 31, a module can reside anywhere and refer- 
ence data anywhere in virtual storage. 


Residency mode is the location in virtual storage where a module 
resides. A module's residency mode can be either 24 (its address is 
less than 16 megabytes) or ANY (it can reside anywhere in virtual 
storage). 


All alternate CPU recovery CACR) modules have an addressing mode CAMODE) 
of 31 and an residency mode CRMODE) of ANY. 


Addressing and residency mode is explained in detail in the 31-Bit 
Addressing publication. 
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CHAPTER 3. CONTROL BLOCK OVERVIEW (NOT APPLICABLE) 


There is no control block information for this component. 
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CHAPTER 4. PROCESS FLOW 
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Figure 1. ACR Processing Overview 
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CHAPTER 5. METHOD OF OPERATION 
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This section contains method of operation diagrams in the HIPO format. 
HIPO diagrams are arranged in an input-processing-output format. The 
left side of the diagram contains data that serves as input to the proc- 
essing steps in the center of the diagram, and the right side contains 
the data that is output from the processing steps. Each processing step 
is numbered; the number corresponds to an amplified explanation of the 
step in the "Extended Description" box. The object module name and 
rece in the extended description point to the code that performs the 
unction. 


The following figure shows the symbols used in HIPO diagrams. The rela- 
tive size and the order of fields in control block illustrations do not 
always represent the actual size and format of the control block. 


Key to Symbols Used in Method-of-Operation Diagrams 


Primary processing - indicates major funtional flow. 


Secondary processing - indicates functional flow 
within a diagram. 


Dato movement, modification, or use. 


—_—-———)> Data reference -- indicates the testing or reading 
of a data area to determine the 
course of subsequent processing. 


f Pointer -- indicates that a data area contains the 
address of another dato area. 


————— Indirect pointer -- indicates intermediate pointers 
have been omitted. 


Connector - indicates that a diagram is 
continued on the next page. 


Figure 2. Key to HIPO Diagrams 


Note: Brief alternate CPU recovery module descriptions appear in Compo- 
nent Diagnosis: Module Descriptions, which contains module descriptions 
for all the system components described in the Component Diagnosis and 
Logic books. 
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Alternate CPU Recovery (ACR) Overview (Part 1 of 4) 
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Alternate CPU Recovery (ACR) Overview (Part 2 of 4) 


Extended Description 


Alternate CPU recovery (ACR) recovers the system on 
another processor when one processor in a multiprocessing 
environment fails. 


1 JtEAVTACR sets up addressability to the ACR work 

areas, saves the caller’s registers and return address, 
and determines whether this is a call for the ACR pre- 
processing phase. 


2 When the call is for ACR’s préprocessing phase, 

t\EAVTACR marks the failed processor and associated 
Vector Facility offline to MVS, creates a trace entry to 
identify the failed processor and that ACR has begun, 
initializes the ACR fields in the LCCAs of both the recovery 
and failed processor, and saves data from the failed proces- 
sor’s PSA in the failed processor’s ACR work area. ACR 
then returns to the EMS or MFA interruption handler 
(step 6). 


3 > -When the call is for ACR’s intermediate processing, 

1EAVTACR sets the LCCA suspend bit for the pro- 
cessor that ACR is presently simulating. When the call is 
the first call for the intermediate phase, ACR switches 
control block data so that the recovery processor is 
simulating the failed processor. This ‘‘processor-switch”’ 
consists of saving vector status and saving some of the 
recovery processor’s PSA fields and control registers in 
the recovery processor’s ACR save area, and replacing 
them with the corresponding data for the failing processor. 
This data includes hardware data and the base pointers to 
processor related control blocks such as the LCCA, PCCA, 
and FRR stacks. This means that although processing is 
always on the recovery processor (because the other one 
has failed), processor-switching can make the system 
recovery routines think they are running on either the 
recovery or the failed processor. 


Module 


IEAVTACR ACRSTART 


IEAVVMCH 


Label 


ACRPREP 


ACRPROSW 


Extended Description 


After ACR has switched to simulate the failing processor, 
ACR simultes a machine-check to get control to RTM to 
perform hardware-recording and recovery processing for 
the failed processor. On return from RTM, ACR goes to 
step 6 to exit to RTM. The process-switch will cause RTM 
to think that it is running on the failed processor, and to 
initiate execution of the FRRs of that processor. 


Label 


ACCRTM 
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Alternate CPU Recovery (ACR) Overview(Part 3 of 4) 
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phase, IEAVTACR could be simulating either the 
recovery or the failed processor. If the LCCA suspend bit 


of the processor that ACR is not simulating is set, JIEAVTACR 


does a processor-switch to.simulate that processor, clear its 
suspend bit, and go to step 6 to exit. Since ACR has just 
done a processor switch, this exit is not to the caller that 
initiated this pass through ACR. It is to the caller of ACR 
on the processor that ACR just switched to. 


5 When ACR is called and neither processor is suspended, 
IEAVTACR executes its post-processing phase. ACR 


first ensures that is is simulating the recovery processor. 
ACR calls the IOS shared up routine (IOSVSHUP) to set the 


UCB for possible reserve/release processing, and calls the IOS . 


machine-check exit routine (IOSRMCH) to process any |/O 
machine checks. ACR then issues a SYSEVENT to inform 
SRM that a processor has gone offline, calls the service pro- - 
cessor (IEAVMSF1) to take the failed processor physically 
offline, issues the ACR-complete message to the 

operator, and resets the ACR flags. 


6 !EAVTACR exit processing consists of testing the ACR 
exit flags in the LCCA and exiting to the indi- 

cated system routine. The routines to which ACR may 

exit are the Lock Manager or a Spin Loop routine (when 

LCCACRLE='1'B), the Dispatcher (LCCACRRM='1'B), 

the External (MFA or EMS) FLIH (LCCACREF=‘1'B), 

and RTM (LCCACRRT='1'B). . 
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